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ABSTRACT 

Appropriateness indexes (statistical formulas) for 
detecting suspiciously high cr low scores cn aptitude tests were 
presented r based on a simulation of the Schclastic Aptitude Test 
(SAT) with 3 rOOO simulated scores^-2r 800 normal and 200 suspicious. 
The traditional index — marginal probability — uses a model for the 
normal examinee's test-taking behavior only, based cn item 
characteristic curve theory. The other twc indices use a 
.generalization of the traditional index which allows ability to vary 
rduring testing, one uses the standard likelihood ratio to quantify 
■the amount of improvement of fit achieved by permitting ability to 
.vary across items. The ether index estinates the parameter values of 
the varying ability models, and uses estimated parameter values to 
indicate the degree of aberrance. Files of candidates with 10%, 
20%, and 40JE aberrance were generated by modifying item scores of 
.iiormal examinees. Results showed that 20% aberrance was surprisingly 
jjiell Jetected for the suspiciously lew gxcup on all three indices. 
'•vSu spiel ously high candidates were even more easily detected. Results 
■'are significant because they suggest that inappropriately scoring 
|cj;iandi dates (such as low ability studerts who cheat or high ability 
Stode^n^ who misinterpret instructions), can be detected without 
:refS¥d'nce to background variables. (CP) 
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MEASURING THE .APPROFPJIATENESS OF MULTIPLE-CHOICE TEST SC.-'^fS 

Abstract 

A student may br sc 3=/p ical and unlike other srudents thai 
aptitude test score i be a c" ffip letely ap ^ r' apri ate zneasur of his 

relative ability. We rrnsii^Er the .roblem of v iisg tne s'uoen- i;attem 
of multiple-choice aroriv> -i-- test ar^s^ers to dec u r ^-^i.h p:^ —s '-ccore is 
an appropriate abi — ty Ser^vr&l indicar ^ s ar" aprrarria*^ eness 

are formulated and ^a-^c v^-n ^ ,«L-3iElatioD c.-T the Schc-fisti ^^^gti- 
ttij p-Test , 
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MEASimZNG niE APPHDPRIATESHKS O' MULOTSS-CHOICE TEST SCORES^ 

Multipie^c^oice apL-ituci^ test scores i^r?^ intended to measure the rela- 
t.ive abilities f students, aat sonaetimes raej fail. A student can be so 
.:nlike other 3Tiinees -^h«t hiis or 4ier test sscore cannc" be regarded as an 
'TOpropriate ^ility meesure. Two hypotheticp- examples are 

Example U (Spario'-^ly :ilfsVi :'^or=^: : A 1g^ ability essminee 
copies sin-'-ftCTii to sev^f-al cfficxilt -wsns, from a mucr inore 
able nel-^t:c:r. 

Example II (Lpur±:>\Ui-,. low score): A very able examiniHe, 
fluent in SprnisL, imi ot yet fluent in Eiiglish, miii;uiiaer- 
stands the v^r -disvg jVWeral rti^tLvely^ «asy questLinis. 

There^ are, cf rjigse, many ether* possiblT^ wetys for s=r:^s to fail. 
We limit ourselves t: T^fies in wMch ^ complXi2£;.ing proces:: e*g., 
selective copying or Irw Engiififc fine icy) tendis produce uniasual 
proportion of eas^/ ^^ens vrc-n^: and liard itens. Tight. Thus we not 
expect to be able co i^-^^nize a hi^ abilil^ cheater who cccasLcnally 
copies frrr^ another hiszi ab.5.].i' examinee bsaiise he will not harr-e many 
easy items wrong. Siimila>^l^j do rot exper' to recognize a low ability, 
low fluency examinee. 

Our goal is to desigr. practical metiiod for using pattern^ of item 
scores to detect aberrant cftnoLiratfl^*. For this purpose we formulate 
appropriateness indices — s^t^tistics cgrnptmed from the examinee's item 
scores that tend to be inu n^hmr- test is an inappropriate measure of 
the examinee's ability ana hi gr :r::h€OT/i s e . A very low index value opens 
the question of 'vrtiether tiM tes:z •Aisquately measures the examinee. 
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An essential feature of our approach to testing problems is the 
use of only the test itself: Appropriateness indices are functions of 
the examinee's item scores. 

In this paper three general types of appropriateness indices are 
formulated. A representative of each type is evaluated using Monte 
Carlo data in -wiiich most of the simulated examinees have responded 
according to the usual aptitude test model while a few aberrant ones 
have not. 

It will be seen that all our indices perform quite well, at least 
for the test we are now using to evaliaate our approach (the Scholastic 
Aptitude Test ) and the types of aberrance we have considered. More 
specifically, suppose 10^ of the examinees are aberrant and we consider 
the ^ of the examinees with the most extreme appropriateness scores. 
A random rule would yield 10?t aberrant examinees and $0^ normal in 
the extreme group. Using appropriateness indices, we have designed 
rules yielding 50^ aberrant, 505^ normal examinees in the extreme group. 

We consider these results important because they suggest that 
exsihinees for whom a test is not appropriate can be detected without 
reference to additional background variables such as race, religion, 
gender, parents' occt5)ation^ etc. That is, they suggest there , is 
internal evidence in the examinee's answer sheet indicating ^daether 
he or she approaches the test as do other candidates with the same 
ability. 



THREE TYPES OF APPH0PRIAfl!!E15ESS HSDICES 



In order to presenr the intuitions simorting our indices we retnim 
to Example I, the hypothetical low ability icopier. He has an improbable 
pattern of responses for a low ability examinee because he has correc":::ly 
answered several hard items. His pattern is also improbaSilfi for a >v™^ 
ability examinee because many easy items are wrong. His irregular 
pattern of item scores seems contrary to the customary psychometric 
assumption that ability is constant during testing. In fact his 
irregular response pattern may be much better described by a model in 
which ability is permitted to change somewhat during testing. 

We have been investigating three basic types of indices. The 
reasoning leading to each will be presented now. Later a 
representative of each type will be formulated more precisely and 
evaluated. 

Our simplest index type, marginal probabilitjj uses a model for 
the normal examinee's test-taking behavior only. The usual model 
(reviewed in the next section) for the Scholastic Aptitude Test (SAT) 
specifies the conditional probability of an observed pattern of item 
responses, the probability that an examinee randomly chosen fromi all 
the examinees with a given ability produces the observed pattent of 
item responses. The marginal probability of a pattern is obtained by 
averaging over the distribution of ability in the population of examinees. 
The marginal probability of an aberrant examinee's pattern is eroected to be 
relatively low because it is unlikely that a high ability person misses 
an easy item or a low ability person passes a hard item. 



The crdner ^"x: ^ idex xypes are generalizations of the .groal model 
taac wsre ^~^3mii i „q Bete' «s mathematically tra!:rtable descrlptnnns of the 
t^ass of aVi" mw tss: axe now studying. Trgse models wer^ suggrs^i^d 
b: ..lae follgt^.ng z3g ?aai r-n^. The aberrant -scaminee^ "canqDl:^caf^ii>*- prrocess" 
letcrs^ as tjcper-:: «viaEE:e of both low abi 1 ity (easy ±tess& fnll d) and 
hicEL. ability (har d itsans: passed). In a sease soon to be lasife ^ ^ecnae* the 
aberrant Ciiinfiiidane :)etaves as if his abiliry were changing thiT " ^ighcxat the 
te^- Th2S - esBEcrP^to obtain a much better fit of the ab^r-rant eamninee's 
date by aailng a ©deaeialization of the test model that all«s ability 
to imry diwrlug t es S'Ui g . 

T^e ir tndlce" (likelihood ratios) use the standard likelihood ratio 
tedid-que to (juanti:^ the amount of improvement of fit atiiieved by permitting 
afeil£±y tu vary ac:TOss items. Thus to conipute a type IT index both the 
laaiaii. mode^i md a opeiBeralization of the usual model are ratted to the 
esBBC nee'a tiata by seiecting parameter va3,ues that maxim.i ze the probability 
oL:' f&: eiaBitnee * s pattern of item responses. The ratio ^ the two 
pr i»*llit:ies indicates how much better the generalized aodel fits. 

^e III indices (estimated ability variation) are -obtained by 
es ir^m ng the parameter values of the varying ability motels and using 
the ecslma±ed parameter values to indicate the degree cf aiberrance. 
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TEST THEORY 

The obsernp^sr pattern or rl^t and wrong answers on a randomly cnii2E«n 
answer :a»«eet will be treated: ms trie outcome o± a two stage experiinenx* 
In the firr^^ ^taj^e, an exami^a^ ^^^^ ability 0 is sanipled. In the 
second ^"^f^p a sequence of trr^enfent dichotomous random variables 
^^^^•^ — - is genecattr^i. These are the item scores, coded 
one for^^^DTX" ^t and zero for iaccirrect. 

Th usiw.. _ irodel for the S/CT is primarily concerned with the relation 
between abili^ and item score? According to this model the conditional 
probat^I4.ty r .at is one i t continuou;^ Increasing function of 

ability/ Kjf^) , called the 71 characterigtic function . The conditional 
probaki-ity imat a randomly Sifileeted examinee with ability © produces 
the pecisrn of right and wrong: answers corresponding to the vector of 
item rssponses U = < u^, ... u^, ••• u^ > is then 

(1) f(u|o) = S P,(er^[i - P.(o)]^ . 

i=l ^ ^ 

For a discussion of item characteristic curve theory see Birribaum (1968). 

In this work each item characteristic function is assumed to have 
the "logistic" functional form 



P.(0) = + (1 . + e"^i^®"^i^-l 

(2) 

0 < a^ , -« < b^ < 00 ^ 0 < c^ < 1 , 



EKLC 



8. 



-6- 

This functional form is used regularly with multiple-oiclce aptitude fests. 
Itor evidence supporting its adequacy for the tests a rrr population we wish 
to study, s& Lord (1968) and Levine and Saxe (19?6). 

This barslc model, in \Aiijh examinees differ onl-.' zn ability, will 
be called tiie standard model of item characteristic :^urve theory. Various 
generalizations will be used to describe aberrant exaraznees. The major 
one used in this paper is the Gaussian model in whic£a w assume that a 
new ability 0^ is sampled for each item. Thus th« probability that 
the i -th item is correct becomes instead of P^(6) . In tine 

Gaussian model, "item abilities" 0^ are assumed vd be independent 

2 

normal random variables with mean 0 and variance a • 

o 

In the first stage of the standard model, an examinee with ability 0 
is sampled. In the first stage of the Gaussian model, on the other hand, 
an examinee with "central ability" 0^ and "ability variance" is 
sampled. Thus the GauBsian model can accommodate two kinds of differences 

between examinees. The standard model can be seen as the limiting case of 

2 

the Gaussian model with the ability variance a equal to zero. 

The generGlization of the conditional probability (l) used to define 
the standard model becomes 
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(3) fiuls-^^.zr" ] = J... J J^Pi(©i)%i(©i/ %[(©i - ©^)/a]d©3^...d©^ 



= n /p.(t)\ (t) \[(t - 0 )/a]dt 



where (ji(x) ia the Gaussian density (2jt)""^/^e"^ . 

In the discussion section we will wish to refer to other generalizations 
of the standard model. Like the Gaussian and standard model, each uses 
a vector of pffrm neters 9 to characterize the examinee and assumes that 
a new ability is independently sampled for each item. The models differ 

in the specification of the distribution of the ©^ and are defined by a 
formula of form 

M f(u|e) =Ji j P.(t/^Q^(t)^"''idrQ(t) 

where the definition of 9 differs from model to model. Ibr example, 
we have the standard model with 9 = < © > and all the ©^ = © , the 
Gaussian model with 

And finally, as a limiting case, we have the unconstrained model in which 
the 0^ may be any value and 

9 = < 6 ,6^, • • • 6., • • • 6 > where - « < © < « 
X ^1 1 n i 
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THE iroiCES 



Type I; Margliial Probabilities 

If the (generally unknown) density for the 0 's is specified and 
denoted by g , then the formula 



can be used to obtain the marginal probability of a vector of item scores 
U* • The standard model specifies a particular formula for the conditional 
probability f(U»|0) . Our different marginal probability indices specify 
different ability densities g(0) . 

The density g(0) summarizes our information about a sampled 
examinee's ability before scoring the test. Suppose we choose 
to ignore that information and base our ability estimate only on the 
examinee's test performance. Mathematically this can be expressed by 
replacing g(0) by a density g(e) with a very small variance and 



centered about 9 , the maximum likelihood estimate of ability obtained 



/f(U*l©)g(0)d0 converges to f(U*|6) • The logarithm of the maximum 



is our representative type I index. We use it basically because it is 
straightforward to calculate and works well, not because we believe 
the single point distribution for g(e) is reasonable. 



00 



(5) 




by maximizing f(u»|©) . As the variance of g(©) tends to zero. 



/q(U») = log f(U»|0) 



other type I (marginal probability) indices can be obtained by 
estimating the ability distribution g(0) from the observed © 
distribution or by true score methods (Lord, 1970). The integi'ation 
required to compute (5) can be intractable. A more easily confuted 
type I index begins with the observation that the function of 0 , 
log f(U*|0) , is ordinarily unimodal and roughly symmetric about 
6 = 0. This suggests the second order approximation of log f(U*|0) 

where is the second derivative of log f(U*|0) evaluated at 

0 = 0. If the ability density is given by the unit normal 
density, we then obtain the approximation of marginal probability 

— — fee e d0 

= h ' ^ (1 - 4) ' 

or equivalently 




Type II; Likelihood Ratios 

In order to use a likelihood ratio as an index of aberrance, we 
first maximize f(U*i0) given in formula (5) over 9 • In logarithmic 
form, the likelihood ratio index is 
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tnax log f(U*l9) ' L * 
9 ^ 

Our representative of this type of index is obtained from the 
Gaussian model, ^ere f(U*i9) f(U*|0^,a^) as given in formula (h). 

Type III: Degree of Aberrance Estimate 

Our best index of this type was obtained from the Gaussian model by 
maximizing the probability f (U**-!©^, a^) . The index a is the square 
root of the maximum likelihood estimate of the ability variance. 
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THE SIMUIATION 

The indices were evaluated with a simulation of th^ Scholastic 
Aptitude Test using Hambleton and Rovenelli's (1975) programs- To simulate 
a "normal" candidate, first an ability 0 was sampled from a normal, zero 
mean, unit variance population. Then the item scores for the examinee 
were simulated as a sequence of independent Bernoulli trials. The success 
probability on the i -th trial is P^(e) as in formula (1) where the 
parameters a^ , b^ , c^ in the formula were obtained from Lord's 
(19^8) fitting of an SAT-V administration. 

Examinees with varying degrees of aberrance were generated by 
modifying the item. scores of normal examinees. To simulate a spuriously 
high examinee cheating on, say, 20^ of the test, first a normal 
examinee was simulated. Then 20^ of the items were sampled without 
replacement. The sampled items were then scored correct whether they 
previously were correct or not. In this way files of candidates with 
hi, ICH, 2(K, and k(H aberrance were generated. 

To generate a spuriously low examinee forced to guess on, say^ 20% of 
the test we again begin by generating a normal examinee and sampling 20* of 
the Items. Since the simulated test is a five -alternative multiple- 
choice test, we rescore the item as correct with probability I/5 and 
incorrect with probability 4/5. In this way files of spuriously low- 
scoring candidates having k<f), 10^, 20^, and kO% aberrance were generated. 

See Appendix I for details of the simulation and methods for finding 
maximum likelihood estimates. See the discussion section for comments on 
the tefitt model and the modelling of aberrance. 
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RESULTS 

The analogy between an observer in a psychophysics experiment 
trying to detect a faint signal and our problem of trying to detect 
aberrant candidates from equivocal patterns of item scores led us to 
use BOC curves (Green and Swets, I966) for evaluating indices. To compute 
an empirical ROC curve for an index, say for concreteness , and a given 
group of aberrant examinees, the index is evaluated for a sample of normal 
and aberrant examinees. The sampled examinees are then ordered from lowest 
to highest appropriateness score. The empirical ROC curve is the 
set of points < x(t),y(t) > where 

x(t) = the proportion of normal examinees with /q < t , 
y(t) = the proportion of aberrant examinees with /q < t . 

A random rule or a rule based on a poor appropriateness index will 
give an ROC curve close to the diagonal x = y . A good appropriateness 
index gives a curve well above the diagonal. The empirical curve 
provides an estimate of the probability that normal candidates 
will be Incorrectly classified by a rule sufficiently stringent 
to detect a given percent of a particular kind of aberrant examinee. For 
example, suppose we choose t so that 5^ of the population Is classified 
as aberrant. Further suppose that 10^ of the population is aberrant. Then 
the intersection of the curve with the line .9x + .ly = .05 gives the 
proportion of aberrant examinees correctly identified and normal examinees 
mlsclasslfled. 

In Figure 1, marginal probability ( ) ROC curves are given for 
the various spuriously low groups. Each curve is based on 3,000 examinees: 
200 examinees with the same percent aberrance and 28OO normal candidates. 
The same normal examinees are used for all HOC curves in this and the 
other figures. 
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Insert Figure 1 about here 



Only the lower parts of the curves are relevant to our immediate 
purpose since a rule improperly classifying more than 50^ of the normal 
candidates is not likely to be used in aptitude testing. The curves show 
that 20^ aberrance is surprisingly well detected. They also show that 
marginal probability does only slightly better than chance for h<f, aberrance. 
The expected net change in total test score for 4^ aberrance turns out to 
be very small, although an occasional very bright and very unlucky 
candidate may be detected. 

Figures 2 and 5 give HOC curves for the likelihood ratio test and 
the degree of aberrance index. These curves show the same pattern as 
the Figure 1 curves, at least over the. lower part of the curves. 

Insert Figures 2 and 5 about here 

Figures 4, 5, 6 give the corresponding ROC curves for the spuriously 
high group. It can be seen that spuriously high aberrant candidates 
are more easily detected than spuriously low candidates. This is to 
be expected since the process generating spuriously low candidates 
necessarily contains a random component lacking in the spuriously high 
process. The spuriously low candidate is forced to guess, but the 
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spuriously high candidate "knows" the right answer. Simulating high 
spuriousness typically results in changing more item scores than 
simulating low spuriousness. 



Insert Figures 5, and 6 about here 



We reconrputed the likelihood ratio SOC curve for the 20^ spuriously 
low group using only those candidates with more than 10^ of the item 
scores actually changed. The resulting cunre, computed from 102 examinees, 
(Figure 7) appears coniparable to the spuriously high curves. 



Insert Figure 7 about here 



The curious crossover in Figure k arises because according to the 
stcmdard model the probability that a very able examinee answers all 
items correctly is nearly one. Thus if we begin with an able candidate 
with item score vector U* and sample kO<jl, of his items and make them 
correct, we obtain a new vector U** \Aieh may have all or all 
but a few very hard items right. Vflien this happens the probability 
e will be very nearly one and frequently larger than e 

The larger the proportion of satnpled items the more frequently /q(U**) 
will be abnormally large. In fact for some large proportion of sampled 
items, the (q BDC curve should pass, sls observed, beneath the diagonal. 

Since rules that improperly classify large numbers of normal, candidates 
cannot be used, the observed anomaly is inconsequential. Furthermore, 
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it does not appear vith the likelihood ratio test. This is probably- 
attributable to the fact that the increment in /^(U**) is accompanied 
by a comparable increment in /^(U**) , the likelihood under the 
Gaussian model. 
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DISCUSSION 

We consider our work important because it demonstrates that in 
at least some cases there is internal evidence in an examinee's answer 
sheet for the appropriateness of a test. We do not, however, feel com- 
mitted to our present indices or aberrance models. We might have Just as 

2 

well worked with the posterior mean of a from the Gaussian model as an 
aberrance index or an aberrance model in which the examinee fluctuates 
between two abilities. For exaniple, there is the aberrance model in 
which the examinee has constant probability p of cheating on an item 
and performing as if he has infinite ability defined by the equation 

(6) f(uj < p,0 >) = n [(1 . p)p.(9) + pl^'^CCi - p)Q.(e)]^"''i , 

i 

0 < p < 1 

The observation that item characteristic curve theory — with its 
local independence assumption — may be too rudimentary to provide an 
adequate descripii:ir>n of the stochastic structure of the SAT is by no 
means flital to our main point, the point that answer sheets contain 
internal evidence of aberrance. In fact it can be argued that departures 
ftpom Q more specific model could be more easily detected. 

In addition to studying other indicators and types of aberrance 
we feel that the followi.ng questions Should be explored: 
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1. What is the effect (on aberrance indices) of using estimated 
item parrameters? 

2. What is the effect of estimating item parameters from san5)les 
containing aberrant examinees? 

3« Can omitted and not reached items be used to increase the 
power of aberrance indices? 

Can the interrelations between various items and subtests be 
incorporated in the test model and used to detect aberrance? 

5« Do aberrance indices indentify a relatively large proportion 
of examinees in sanqples of candidates speaking English as a 
second language, in samples of candidates with moderately high 
test scores but very low socioeconomic status, in saiii)les 
of known cheaters? 

These questions form a rich and fertile area for future research. 
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PROB. NORMAL EXAMINEES (2800) 
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Appendix 

Technical details on the computations are collected and listed below: 

1. During the simulation of normal examinees a Tausworthe generator 
(V/hittlesey, I968) was used to generate item scores. To obtain 
Gaussian distributed abilities Pike's (1965) algorithm was applied 
to numbers obtained from the Tausworthe generator. 

2. During the simulation aberrant examinees Learmonth and Lewis's 
(1975) algorithm was used to generate numbers uniformly dis- 
tributed on the unit interval. To sanrple a proportion of 
items without replacement, 1 + (number of items) x (uniformly 
distributed number) was truncated to obtain an integer. This 
process was repeated (with new uniformly distributed numbers) 
until the desired number of items was selected. The uniformly 
distributed numbers were also used to modify the item scores 
of the sa tap led items for the spuriously low scoring aberrant 
candidates. A saniple item was scored "correct" if a uniformly 
distributed number was < -2 . 

5. To compute , © was first estimated with LOGIST (Wood, 
Wingersky and Lord, 1976). Estimated © 's less than -5 
were set equal to -5. 

4. To compute L^ and a , the steepest descent method in 

Gruvaeus and Jflreekog (1970) was used to maximize the likelihood 

function for the Gaussian model. The starting point was 

0 a LOGIST estimated 0 and a = •I Only the steepest descent 
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